class: center, middle, inverse, title-slide .title[ # RNA sequencing ] .author[ ###
Haris Khan
• 05-Nov-2022 ] .institute[ ### Zifo RnD Solutions ] --- exclude: true count: false <link href="https://fonts.googleapis.com/css?family=Roboto|Source+Sans+Pro:300,400,600|Ubuntu+Mono&subset=latin-ext" rel="stylesheet"> <link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.3.1/css/all.css" integrity="sha384-mzrmE5qonljUremFsqc01SB46JvROS7bZs3IO2EmfFsd15uHvIt+Y8vEf7N7fWAU" crossorigin="anonymous"> <!-- ------------ Only edit title, subtitle & author above this ------------ --> --- # Introduction ## Overview of RNA-seq * Gene expression studies get a snapshot of the RNA molecules present in a biological system * Gene expression dictates what cells are doing or what cells are capable of doing * A basic overview of the main steps in a standard RNA-seq experiment is given below <br> <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/rnaseq/images/rnaseq-protocol.png" alt="A typical RNA-seq experiment" width="75%" /> <p class="caption">A typical RNA-seq experiment</p> </div> ??? * The first step is the extraction and purification of RNA from a sample, followed by an enrichment of target RNAs * Most commonly used is poly(A) capture, to select for polyadenylated RNAs * Or ribosomal depletion, to deplete ribosomal RNAs that are highly abundant in a cell * The selected RNAs are then chemically or enzymatically fragmented to molecules of approproiate size (e.g., 300 - 500 bp) * Single-stranded target RNAs are reverse-transcribed to cDNA, the RNA is then degraded, and the cDNA is complemented to a double strand * Adapter sequences are either ligated to the 3' and 5' end of the double-stranded cDNA or used as primers in the reverse transcription reaction * The final cDNA library consists of cDNA inserts flanked by an adapter sequence on each end * In the last step, the cDNA library is amplified by polymerase chain reaction (PCR) using parts of the adapter sequences as primers --- # Design Aspects of RNA-seq * Specific aspects to be considered while designing an RNA-seq experiment include: * The number of replicates * Three is the minimum required to do any statistical analysis * The depth of sequencing * In many genomic experiments resources are scarce (e.g., material from subjects) * The first driver of sample size is often budget --- # RNA-seq Applications * The popularity of RNA-seq is driven by its large number of applications * One of the main application areas is gene regulation: * Comparison of gene expression between different tissues, cell types, genotypes, stimulation conditions, time points, disease states, growth condtions, and so on * The goal of such comparisons is to identify the genes that change in expression to understand the molecular pathways that are used or altered --- # Alignment and Quantification ## Introduction * After an experiment has been conducted, the analyst is presented with FASTQ files * Following sufficient quality control, the next step will either be: 1. Alignment to a reference genome 2. Alignment to a reference transcriptome <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/rnaseq/images/rnaseq-alignment.png" alt="An illustration of spliced alignment of RNA-seq fragments" width="60%" /> <p class="caption">An illustration of spliced alignment of RNA-seq fragments</p> </div> --- # Alignment and Quantification ## Spliced alignment to a reference Genome * A popular solution for handling RNA-seq alignments is to use a splice-aware aligner * Popular splice-aware aligners include [STAR](https://github.com/alexdobin/STAR) and [HISAT](http://daehwankimlab.github.io/hisat2/) <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/rnaseq/images/genome-alignment.png" alt="Spliced alignment of RNA-seq fragments to a genome" width="80%" /> <p class="caption">Spliced alignment of RNA-seq fragments to a genome</p> </div> --- # Alignment and Quantification ## Unspliced alignment to a reference transcriptome * An alternative to splice-aware genome alignment is direct transcriptome alignment * Direct transcriptome alignment consists of aligning against a set of known transcripts * Popular transcriptome aligners include [Kallisto](https://pachterlab.github.io/kallisto/) and [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html) <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/rnaseq/images/transcriptome-alignment.png" alt="Unspliced alignment of RNA-seq fragments to a transcriptome" width="50%" /> <p class="caption">Unspliced alignment of RNA-seq fragments to a transcriptome</p> </div> --- # Alignment and Quantification ## Gene- and Transcript-Level Quantification From RNA-seq Data * One of the main uses of RNA-seq is to assess gene- and transcript-level abundances * Most commonly, abundances are estimated at the level of genes * Transcript-level abundances have become more widely used --- # Alignment and Quantification ## Gene- and Transcript-Level Quantification From RNA-seq Data * Gene-level quantification consists of assigning reads to genes * A gene consists of all transcripts produced from a specific strand at a specific locus * The total expression of a gene is the sum of the expression of its isoforms * Popular stand-alone read counting tools include [featureCounts](http://subread.sourceforge.net) and [HTSeq](https://htseq.readthedocs.io/en/master/) --- # Alignment and Quantification ## Transcript Quantification * Gene-level quantification consists of assigning reads to genes * A gene consists of all transcripts produced from a specific strand at a specific locus * The total expression of a gene is the sum of the expression of its isoforms * Popular stand-alone read counting tools include [featureCounts](http://subread.sourceforge.net) and [HTSeq](https://htseq.readthedocs.io/en/master/) --- # Differential expression ## Overview * Following alignment and quantification, the next step is testing for differential expression (DE) * The starting point for DE is often a count table: * Rows represent genomic features (e.g., genes) * Columns represent samples (i.e., experimental units) * The goal of DE is to identify genes which are differentially expressed between conditions --- # Differential expression ## Workflow .pull-left-50[ * [DESeq2](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html): The package DESeq2 provides methods to test for differential expression by use of negative binomial generalized linear models. * [EdgeR](https://www.bioconductor.org/packages/devel/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf): Also a commonly used workflow but differs from DESeq2 in normalisation strategy. EdgeR applies a weighted mean of log ratios-based method (Trimmed Mean of M-Values) where DESeq2 uses a geometric normalisation. * **Important Takeaway!** No statistical modelling can fully capture biological phenomena. Statistical methods rely on assumptions and requirements that are only partially satisfied. It is important to understand the basics of statistical models to know which one works best on your data! ] .pull-right-50[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/rnaseq/images/de-analysis-overview.png" alt="Schematic of a DE analysis for RNA-seq data" width="100%" /> <p class="caption">Schematic of a DE analysis for RNA-seq data</p> </div> ] --- # Differential expression ## Filtering * Genes with very low counts across all libraries provide little evidence for DE * From a biological point of view: * a gene must be expressed at some minimal level before it is likely to be translated into a protein or to be biologically important * From a statistical point of view the more inferences are made, the more likely erroneous inferences become * The expression level is indistinguishable from technical noise * These genes should be filtered out prior to further analysis: * As a rule of thumb, genes are dropped if they can’t possibly be expressed in all the samples for any of the conditions * Users can set their own definition of genes being expressed * Usually a gene is required to have a count of 5-10 in a library to be considered expressed in that library --- # Differential expression ## Normalization .pull-left-50[ * The observed counts of the genes cannot be directly compared across samples since there are differences in sequencing depth across libraries * Several strategies have been developed to normalize counts to facilitate cross-sample comparisons: * **Library Size**: One source of variation between samples is the difference in library size, where library size is the total number of reads generated for a given sample. * **Gene Length**: Larger genes will have inevitably higher read counts compared to smaller genes due to the difference in their gene lengths or sizes. * **Across Sample**: Across sample normalization methods correct for other technical artifacts (batch effects, for example) to improve data quality and ability to detect biologically relevant genes. ] .pull-right-50[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/rnaseq/images/normalization.png" alt="Gene Expression Normalization Methods" width="100%" /> <p class="caption">Gene Expression Normalization Methods</p> </div> ] --- # Differential expression ## Modeling and estimation .pull-left-50[ * Even with extensive datasets availability, the quantitative understanding of gene regulation is far from comprehensive, since the available data usually gives only an average of many cell states or a few snapshots of dynamic systems. * Obtaining a complete operational picture using solely experimental approaches is challenging. Mathematical modeling provides an alternative path for this key problem, offering new approaches that incorporate detailed dynamics of sets of biochemical interactions. * Therefore, differential expression analysis tools apply different techniques to achieve this. DESeq2, for example, estimates **size factors** and **gene-wise dispersion** to create a linear model for each gene. ] .pull-right-50[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/rnaseq/images/de_theory.png" alt="Gene Expression modeling and estimation" width="100%" /> <p class="caption">Gene Expression modeling and estimation</p> </div> ] --- # RNA-seq analysis <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/rnaseq/images/rnaseq-experiment.png" alt="A typical RNA-seq experiment" width="60%" /> <p class="caption">A typical RNA-seq experiment</p> </div> --- # RNA-seq analysis <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/rnaseq/images/rnaseq-analysis-roadmap.png" alt="A generic roadmap for RNA-seq computational analyses" width="2596" /> <p class="caption">A generic roadmap for RNA-seq computational analyses</p> </div> --- # RNA-seq analysis <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/rnaseq/images/rnaseq-analysis-strategies.png" alt="Read mapping and transcript identification strategies" width="1891" /> <p class="caption">Read mapping and transcript identification strategies</p> </div> --- # RNA-seq analysis <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/rnaseq/images/rnaseq-common-tools.png" alt="Common software tools in use for differential gene expression analysis using RNA-seq data" width="2835" /> <p class="caption">Common software tools in use for differential gene expression analysis using RNA-seq data</p> </div> --- # RNA-seq analysis <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/rnaseq/images/rnaseq-sequencing-technology.png" alt="An overview is shown of the three main sequencing technologies for RNA-seq" width="100%" /> <p class="caption">An overview is shown of the three main sequencing technologies for RNA-seq</p> </div> --- # RNA-seq analysis <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/rnaseq/images/rnaseq-analysis-technology.png" alt="Comparison of short-read, long-read and direct RNA-seq analysis" width="85%" /> <p class="caption">Comparison of short-read, long-read and direct RNA-seq analysis</p> </div> --- # RNA-seq analysis <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/rnaseq/images/rnaseq-analysis-workflow.png" alt="RNA-seq data analysis workflow for differential gene expression" width="100%" /> <p class="caption">RNA-seq data analysis workflow for differential gene expression</p> </div> <!-- --------------------- Do not edit this and below --------------------- --> --- name: end_slide class: end-slide, middle count: false # Thank you. Questions? .end-text[ <p class="smaller"> <span class="small" style="line-height: 1.2;">Graphics from </span><img src="./assets/freepik.jpg" style="max-height:20px; vertical-align:middle;"><br> Created: 05-Nov-2022 • James Ashmore • <a href="https://www.zifornd.com/category/omics-bioinformatics">Bioinformatics</a> • <a href="https://www.zifornd.com">Zifo</a> </p> ]